data depth
Data Depth as a Risk
Castellanos, Arturo, Mozharovskyi, Pavlo
Data depths are score functions that quantify in an unsupervised fashion how central is a point inside a distribution, with numerous applications such as anomaly detection, multivariate or functional data analysis, arising across various fields. The halfspace depth was the first depth to aim at generalising the notion of quantile beyond the univariate case. Among the existing variety of depth definitions, it remains one of the most used notions of data depth. Taking a different angle from the quantile point of view, we show that the halfspace depth can also be regarded as the minimum loss of a set of classifiers for a specific labelling of the points. By changing the loss or the set of classifiers considered, this new angle naturally leads to a family of "loss depths", extending to well-studied classifiers such as, e.g., SVM or logistic regression, among others. This framework directly inherits computational efficiency of existing machine learning algorithms as well as their fast statistical convergence rates, and opens the data depth realm to the high-dimensional setting. Furthermore, the new loss depths highlight a connection between the dataset and the right amount of complexity or simplicity of the classifiers. The simplicity of classifiers as well as the interpretation as a risk makes our new kind of data depth easy to explain, yet efficient for anomaly detection, as is shown by experiments.
Model-Free Kernel Conformal Depth Measures Algorithm for Uncertainty Quantification in Regression Models in Separable Hilbert Spaces
Matabuena, Marcos, Ghosal, Rahul, Mozharovskyi, Pavlo, Padilla, Oscar Hernan Madrid, Onnela, Jukka-Pekka
Depth measures are powerful tools for defining level sets in emerging, non--standard, and complex random objects such as high-dimensional multivariate data, functional data, and random graphs. Despite their favorable theoretical properties, the integration of depth measures into regression modeling to provide prediction regions remains a largely underexplored area of research. To address this gap, we propose a novel, model-free uncertainty quantification algorithm based on conditional depth measures--specifically, conditional kernel mean embeddings and an integrated depth measure. These new algorithms can be used to define prediction and tolerance regions when predictors and responses are defined in separable Hilbert spaces. The use of kernel mean embeddings ensures faster convergence rates in prediction region estimation. To enhance the practical utility of the algorithms with finite samples, we also introduce a conformal prediction variant that provides marginal, non-asymptotic guarantees for the derived prediction regions. Additionally, we establish both conditional and unconditional consistency results, as well as fast convergence rates in certain homoscedastic settings. We evaluate the finite--sample performance of our model in extensive simulation studies involving various types of functional data and traditional Euclidean scenarios. Finally, we demonstrate the practical relevance of our approach through a digital health application related to physical activity, aiming to provide personalized recommendations
Fast kernel half-space depth for data with non-convex supports
Castellanos, Arturo, Mozharovskyi, Pavlo, d'Alché-Buc, Florence, Janati, Hicham
Data depth is a statistical function that generalizes order and quantiles to the multivariate setting and beyond, with applications spanning over descriptive and visual statistics, anomaly detection, testing, etc. The celebrated halfspace depth exploits data geometry via an optimization program to deliver properties of invariances, robustness, and non-parametricity. Nevertheless, it implicitly assumes convex data supports and requires exponential computational cost. To tackle distribution's multimodality, we extend the halfspace depth in a Reproducing Kernel Hilbert Space (RKHS). We show that the obtained depth is intuitive and establish its consistency with provable concentration bounds that allow for homogeneity testing. The proposed depth can be computed using manifold gradient making faster than halfspace depth by several orders of magnitude. The performance of our depth is demonstrated through numerical simulations as well as applications such as anomaly detection on real data and homogeneity testing.
Statistical process monitoring of artificial neural networks
Malinovskaya, Anna, Mozharovskyi, Pavlo, Otto, Philipp
The rapid advancement of models based on artificial intelligence demands innovative monitoring techniques which can operate in real time with low computational costs. In machine learning, especially if we consider artificial neural networks (ANNs), the models are often trained in a supervised manner. Consequently, the learned relationship between the input and the output must remain valid during the model's deployment. If this stationarity assumption holds, we can conclude that the ANN provides accurate predictions. Otherwise, the retraining or rebuilding of the model is required. We propose considering the latent feature representation of the data (called "embedding") generated by the ANN to determine the time when the data stream starts being nonstationary. In particular, we monitor embeddings by applying multivariate control charts based on the data depth calculation and normalized ranks. The performance of the introduced method is compared with benchmark approaches for various ANN architectures and different underlying data formats.
On Language Clustering: A Non-parametric Statistical Approach
Chattopadhyay, Anagh, Ghosh, Soumya Sankar, Karmakar, Samir
Any approach aimed at pasteurizing and quantifying a particular phenomenon must include the use of robust statistical methodologies for data analysis. With this in mind, the purpose of this study is to present statistical approaches that may be employed in nonparametric nonhomogeneous data frameworks, as well as to examine their application in the field of natural language processing and language clustering. Furthermore, this paper discusses the many uses of nonparametric approaches in linguistic data mining and processing. The data depth idea allows for the centre-outward ordering of points in any dimension, resulting in a new nonparametric multivariate statistical analysis that does not require any distributional assumptions. The concept of hierarchy is used in historical language categorisation and structuring, and it aims to organise and cluster languages into subfamilies using the same premise. In this regard, the current study presents a novel approach to language family structuring based on non-parametric approaches produced from a typological structure of words in various languages, which is then converted into a Cartesian framework using MDS. This statistical-depth-based architecture allows for the use of data-depth-based methodologies for robust outlier detection, which is extremely useful in understanding the categorization of diverse borderline languages and allows for the re-evaluation of existing classification systems. Other depth-based approaches are also applied to processes such as unsupervised and supervised clustering. This paper therefore provides an overview of procedures that can be applied to nonhomogeneous language classification systems in a nonparametric framework.
Feature Selection using e-values
Majumdar, Subhabrata, Chatterjee, Snigdhansu
Navigating an exponentially growing feature space using wrapper methods is In the context of supervised parametric models, NP-hard (Natarajan, 1995), and case-specific search strategies we introduce the concept of e-values. An e-value like k-greedy, branch-and-bound, simulated annealing is a scalar quantity that represents the proximity of are needed. Sparse penalized embedded methods can tackle the sampling distribution of parameter estimates high-dimensional data, but have inferential and algorithmic in a model trained on a subset of features to that issues, such as biased Lasso estimates (Zhang & Zhang, of the model trained on all features (i.e. the full 2014) and the use of convex relaxations to compute approximate model). Under general conditions, a rank ordering local solutions (Wang et al., 2013; Zou & Li, 2008). of e-values separates models that contain all essential features from those that do not. Feature selection in dependent data models has received comparatively lesser attention. Existing implementations of The e-values are applicable to a wide range of wrapper and embedded methods have been adapted for dependent parametric models. We use data depths and a fast data scenarios, such as mixed effect models (Meza resampling-based algorithm to implement a feature & Lahiri, 2005; Nguyen & Jiang, 2014; Peng & Lu, 2012) selection procedure using e-values, providing and spatial models (Huang et al., 2010; Lee & Ghosh, 2009).
Affine-Invariant Integrated Rank-Weighted Depth: Definition, Properties and Finite Sample Analysis
Staerman, Guillaume, Mozharovskyi, Pavlo, Clémençon, Stéphan
Because it determines a center-outward ordering of observations in $\mathbb{R}^d$ with $d\geq 2$, the concept of statistical depth permits to define quantiles and ranks for multivariate data and use them for various statistical tasks (\textit{e.g.} inference, hypothesis testing). Whereas many depth functions have been proposed \textit{ad-hoc} in the literature since the seminal contribution of \cite{Tukey75}, not all of them possess the properties desirable to emulate the notion of quantile function for univariate probability distributions. In this paper, we propose an extension of the \textit{integrated rank-weighted} statistical depth (IRW depth in abbreviated form) originally introduced in \cite{IRW}, modified in order to satisfy the property of \textit{affine-invariance}, fulfilling thus all the four key axioms listed in the nomenclature elaborated by \cite{ZuoS00a}. The variant we propose, referred to as the Affine-Invariant IRW depth (AI-IRW in short), involves the covariance/precision matrices of the (supposedly square integrable) $d$-dimensional random vector $X$ under study, in order to take into account the directions along which $X$ is most variable to assign a depth value to any point $x\in \mathbb{R}^d$. The accuracy of the sampling version of the AI-IRW depth is investigated from a nonasymptotic perspective. Namely, a concentration result for the statistical counterpart of the AI-IRW depth is proved. Beyond the theoretical analysis carried out, applications to anomaly detection are considered and numerical results are displayed, providing strong empirical evidence of the relevance of the depth function we propose here.
Depth-based pseudo-metrics between probability distributions
Staerman, Guillaume, Mozharovskyi, Pavlo, Clémençon, Stéphan, d'Alché-Buc, Florence
Data depth is a non parametric statistical tool that measures centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Consequently, its upper level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. In this work, we propose two new pseudo-metrics between continuous probability measures based on data depth and its associated central regions. The first one is constructed as the Lp-distance between data depth w.r.t. each distribution while the second one relies on the Hausdorff distance between their quantile regions. It can further be seen as an original way to extend the one-dimensional formulae of the Wasserstein distance, which involves quantiles and cdfs, to the multivariate space. After discussing the properties of these pseudo-metrics and providing conditions under which they define a distance, we highlight similarities with the Wasserstein distance. Interestingly, the derived non-asymptotic bounds show that in contrast to the Wasserstein distance, the proposed pseudo-metrics do not suffer from the curse of dimensionality. Moreover, based on the support function of a convex body, we propose an efficient approximation possessing linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in experiments. Furthermore, by construction the regions-based pseudo-metric appears to be robust w.r.t. both outliers and heavy tails, a behavior witnessed in the numerical experiments.
The Area of the Convex Hull of Sampled Curves: a Robust Functional Statistical Depth Measure
Staerman, Guillaume, Mozharovskyi, Pavlo, Clémençon, Stephan
With the ubiquity of sensors in the IoT era, statistical observations are becoming increasingly available in the form of massive (multivariate) time-series. Formulated as unsupervised anomaly detection tasks, an abundance of applications like aviation safety management, the health monitoring of complex infrastructures or fraud detection can now rely on such functional data, acquired and stored with an ever finer granularity. The concept of statistical depth, which reflects centrality of an arbitrary observation w.r.t. a statistical population may play a crucial role in this regard, anomalies corresponding to observations with 'small' depth. Supported by sound theoretical and computational developments in the recent decades, it has proven to be extremely useful, in particular in functional spaces. However, most approaches documented in the literature consist in evaluating independently the centrality of each point forming the time series and consequently exhibit a certain insensitivity to possible shape changes. In this paper, we propose a novel notion of functional depth based on the area of the convex hull of sampled curves, capturing gradual departures from centrality, even beyond the envelope of the data, in a natural fashion. We discuss practical relevance of commonly imposed axioms on functional depths and investigate which of them are satisfied by the notion of depth we promote here. Estimation and computational issues are also addressed and various numerical experiments provide empirical evidence of the relevance of the approach proposed.
CRAD: Clustering with Robust Autocuts and Depth
Abstract--We develop a new density-based clustering algorithmclusters? The performance of CRAD is evaluated through extensive experimental studies. The number of observations in keywords-clustering, space-time processes, data depth cluster 3 is larger than that in clusters 1 and 2. The result of each algorithm is selected by searching the best clustering I. INTRODUCTION Clustering results are shown in Figure 1. Data depth methodology is a widely employed nonparametric Currently available methods such as DBCA, DBSCAN, and tool in multivariate and functional data analysis, with OPTICS, all fail to separate the cluster 1 and 2; in contrast, applications ranging from outlier detection to clustering and our new CRAD algorithm is able to detect both. Depth measures the "centrality" (or for this phenomenon is that both DBSCAN and DBCA use "outlyingness") of a given object with respect to an observed globally-defined parameters (i.e., ɛ and θ, respectively) to data cloud [4], [5].